NIXL EP: Use VMM API for device memory allocation.#1415
NIXL EP: Use VMM API for device memory allocation.#1415itayalroy merged 32 commits intoai-dynamo:mainfrom
Conversation
|
👋 Hi ofirfarjun7! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CUDA Driver VMM-backed allocation support: introduces Changes
Sequence Diagram(s)mermaid App->>Driver: vmm_init(size, device) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 66-77: The two calls to cuDeviceGetAttribute (checking
CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED and
CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED) do not check their CUresult
return values; update the code around the variables rdma_vmm_supported and
fabric_supported to capture the CUresult, test it against CUDA_SUCCESS, and on
failure throw or log a runtime_error that includes the cuGetErrorString result
and context (which attribute failed and for which device); ensure you only rely
on rdma_vmm_supported/fabric_supported after the call succeeds so you don't act
on zero-initialized values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8bc85385-7d61-405a-90d0-e86c5ca8956c
📒 Files selected for processing (2)
examples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hpp
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.hpp`:
- Around line 102-110: The destructor ~cuda_allocator() currently unmaps and
releases VMM state without waiting for GPU work; fix by calling
cudaDeviceSynchronize() at the start of ~cuda_allocator() before any
cuMemUnmap/cuMemAddressFree/cuMemRelease calls so all in-flight
kernels/transfers are fenced; additionally, ensure allocator creation paths
cannot bypass that fence on exception by either making explicitly_destroy
default to false or wrapping allocator construction in the init paths
(_nixl_agent_init(), _nixl_ep_init(), or any init() that creates the allocator)
with a try/catch that calls cudaDeviceSynchronize() before rethrowing so
stack-unwound destructor runs safe; keep references to the methods destroy() and
~cuda_allocator() when making changes.
- Around line 56-64: The allocator currently queries the ambient CUDA context
via cuCtxGetDevice() which is unsafe; change cuda_allocator to accept an
explicit CUdevice (or device_id) parameter, set the context explicitly inside
the constructor using cuCtxSetCurrent() and handle errors (throw on failure),
then update Buffer::init call sites to pass the tracked device_id into
cuda_allocator so allocations are bound to the correct device regardless of
external context switches; ensure you remove the cuCtxGetDevice() usage in
cuda_allocator and propagate failures with clear runtime_error messages.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d14696d0-b771-4297-bed3-a7596c6c4643
📒 Files selected for processing (1)
examples/device/ep/csrc/nixl_ep.hpp
|
/build |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/device/ep/csrc/nixl_ep.cpp`:
- Around line 70-127: The static cuda_alloc_ctx ctx binds device-specific fields
(prop.location.id, granularity, fabric support) to whichever device was active
at first cuCtxGetDevice() call, causing wrong-device VMM allocations; change
initialization so driver/version checks remain global but device-specific
queries (cuCtxGetDevice(), cuDeviceGetAttribute(),
cuMemGetAllocationGranularity()) are performed per-call or cached per-device
(keyed by device ID) instead of in the static cuda_alloc_ctx constructor—either
remove static cuda_alloc_ctx ctx and build a ctx per vmm_init()/allocation (with
a device-ID cache), or split cuda_alloc_ctx into a static global verifier and a
per-device struct populated on each allocation using the current context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 120c00f8-eacc-4630-9658-e92390364a9c
📒 Files selected for processing (2)
examples/device/ep/csrc/nixl_ep.cppexamples/device/ep/csrc/nixl_ep.hpp
|
/build |
|
/ok to test b1cab66 |
|
/build |
itayalroy
left a comment
There was a problem hiding this comment.
LGTM, just a couple of questions:
- How was it tested? I think we should make sure it does not break non-MNNVL systems
- Do we have any perf numbers?
- Do we see the expected improvement on MNNVL systems?
- Is there a degradation on non-MNNVL systems?
|
I tested it on DFW & Lyris (supports mnnvl). Lyris both single node and multinode. |
|
DFW, 2 nodes, 8 proc per node, 16 experts, 256 tokens, 8 topk. With NVLINK PR With NVLINK Lyris, 2 nodes, 4 proc per node, 16 experts, 128 tokens, 8 topk. With NVLINK (multi node nvlink) PR With NVLINK (multi node nvlink) |
What?
Use VMM API for device memory allocation in nixl_ep
Why?
To support multi node nvlink.
How?
Summary by CodeRabbit
Refactor
New Features